Training Configuration

Overview

OpenCLIP provides extensive configuration options for training CLIP models. This page documents all important training flags and hyperparameters from params.py. To see all available options:

python -m open_clip_train.main --help

Data Configuration

Training Data

--train-data

string

Path to training data. For WebDataset, use glob patterns like /data/train-{0000..2175}.tar. Multiple sources can be combined with ::.

--train-data "/data/cc12m/train-{0000..2175}.tar"
--train-data "/data/cc12m/train.tar::/data/laion/train.tar"  # Multiple sources

--val-data

string

Path to validation data (same format as train-data).

--val-data "/data/val.csv"

--train-num-samples

integer

Total number of samples in training dataset. Required for WebDataset.

--train-num-samples 10968539  # CC12M

--val-num-samples

integer

Number of samples in validation dataset.

--dataset-type

string

default:"auto"

Dataset format: webdataset, csv, synthetic, or auto (auto-detect).

--dataset-type webdataset

--dataset-resampled

boolean

Enable sampling with replacement for webdataset. Recommended for large datasets and multiple data sources.

--dataset-resampled

CSV Data Parameters

--csv-separator

string

default:"\\t"

Column separator for CSV files (tab by default).

--csv-separator ","  # Use comma separator

--csv-img-key

string

default:"filepath"

Column name for image paths in CSV.

--csv-img-key filepath

--csv-caption-key

string

default:"title"

Column name for captions in CSV.

--csv-caption-key title

Data Upsampling

--train-data-upsampling-factors

string

Upsampling factors for multiple data sources, separated by ::. Controls relative sampling probability.

--train-data "/data/cc12m/train.tar::/data/cc3m/train.tar" \
--train-data-upsampling-factors "1::4"  # Sample CC3M 4x more frequently

Model Configuration

Model Selection

--model

string

default:"RN50"

Model architecture to train. See Model Architectures for all options.

--model ViT-B-32
--model ViT-L-14
--model RN50
--model coca_ViT-L-14  # CoCa model

--pretrained

string

Load pretrained weights. Can be a tag (e.g., laion2b_s34b_b79k) or a local path.

--pretrained laion2b_s34b_b79k
--pretrained /path/to/checkpoint.pt

--pretrained-image

boolean

Load ImageNet pretrained weights for the image encoder (if available).

--pretrained-image

Model Modifications

--force-image-size

integer

Override default image input size.

--force-image-size 224
--force-image-size 336 336  # Different height/width

--force-context-length

integer

Override default text context length.

--force-context-length 77

--force-patch-dropout

float

Override patch dropout probability for ViT models. Use 0.5-0.75 for 2-3x speedup.

--force-patch-dropout 0.5  # 50% patch dropout
--force-patch-dropout 0.0  # Disable patch dropout (fine-tuning)

--force-quick-gelu

boolean

Force QuickGELU activation (for compatibility with older checkpoints).

--force-custom-text

boolean

Force separate text tower (CustomTextCLIP architecture).

Training Hyperparameters

Batch Size and Epochs

--batch-size

integer

default:"64"

Batch size per GPU. Total batch size = batch_size × num_gpus × accum_freq.

--batch-size 256

--epochs

integer

default:"32"

Number of training epochs.

--epochs 32

--accum-freq

integer

default:"1"

Gradient accumulation frequency. Simulates larger batch sizes.

--accum-freq 4  # Effective batch = batch_size × 4

Learning Rate

--lr

float

Learning rate. Default depends on model:

ViT models: 5e-4
ResNet models: 5e-4

--lr 1e-3
--lr 5e-4

--warmup

integer

default:"10000"

Number of warmup steps (linear warmup from 0 to lr).

--warmup 10000

--lr-scheduler

string

default:"cosine"

Learning rate schedule: cosine, const, or const-cooldown.

--lr-scheduler cosine
--lr-scheduler const  # Constant LR after warmup
--lr-scheduler const-cooldown  # Constant with cooldown

--epochs-cooldown

integer

Number of cooldown epochs for const-cooldown scheduler.

--lr-scheduler const-cooldown \
--epochs-cooldown 5

--lr-cooldown-end

float

default:"0.0"

End learning rate for cooldown.

--lr-cooldown-end 1e-6

--lr-cooldown-power

float

default:"1.0"

Power for polynomial cooldown (1.0 = linear).

--lr-cooldown-power 1.0

Optimizer

--opt

string

default:"adamw"

Optimizer choice. Use adamw or timm/{optimizer} for timm optimizers.

--opt adamw
--opt timm/sgd

--beta1

float

Adam beta1 parameter. Default:

ViT: 0.9
ResNet: 0.9

--beta1 0.9

--beta2

float

Adam beta2 parameter. Default:

ViT: 0.98
ResNet: 0.999

--beta2 0.98

--eps

float

Adam epsilon parameter. Default:

ViT: 1e-6
ResNet: 1e-8

--eps 1e-6

--wd

float

default:"0.2"

Weight decay (L2 regularization).

--wd 0.2
--wd 0.1

--momentum

float

Momentum for timm optimizers (SGD, etc.).

--momentum 0.9

Gradient Clipping

--grad-clip-norm

float

Gradient clipping norm. Prevents gradient explosion.

--grad-clip-norm 1.0

Precision and Memory

Precision

--precision

string

default:"amp"

Training precision: amp, amp_bf16, bf16, fp16, fp32.

--precision amp        # Automatic Mixed Precision (FP16) - Recommended
--precision amp_bf16   # AMP with BFloat16 (A100/H100)
--precision fp32       # Full precision (slow, baseline)

Memory Optimization

--grad-checkpointing

boolean

Enable gradient checkpointing to reduce memory usage (slower training).

--grad-checkpointing

--local-loss

boolean

Calculate loss with local features @ global (reduces memory from O(n²) to O(n)).

--local-loss

--gather-with-grad

boolean

Enable gradient flow through feature gathering (use with —local-loss).

--gather-with-grad

Always use --local-loss and --gather-with-grad together for multi-GPU training (8+ GPUs). See Distributed Training.

Data Loading

--workers

integer

default:"4"

Number of data loading workers per GPU.

--workers 8  # 8 workers per GPU

Recommended: 4-8 workers per GPU for optimal performance.

Image Preprocessing

--image-mean

float[]

Override image normalization mean (RGB).

--image-mean 0.485 0.456 0.406  # ImageNet statistics

--image-std

float[]

Override image normalization std (RGB).

--image-std 0.229 0.224 0.225  # ImageNet statistics

--image-interpolation

string

Image resize interpolation: bicubic, bilinear, or random.

--image-interpolation bicubic

--image-resize-mode

string

Image resize mode: shortest, longest, or squash (inference only).

--image-resize-mode shortest

--aug-cfg

key=value

Data augmentation configuration (key-value pairs).

--aug-cfg scale_range=0.08::1.0 ratio_range=0.75::1.33

Model Locking (Transfer Learning)

Image Tower

--lock-image

boolean

Lock (freeze) entire image encoder.

--lock-image

--lock-image-unlocked-groups

integer

default:"0"

Leave last N image tower layer groups unlocked.

--lock-image --lock-image-unlocked-groups 2  # Freeze all but last 2 groups

--lock-image-freeze-bn-stats

boolean

Freeze BatchNorm running statistics in locked layers.

--lock-image-freeze-bn-stats

Text Tower

--lock-text

boolean

Lock (freeze) entire text encoder.

--lock-text

--lock-text-unlocked-layers

integer

default:"0"

Leave last N text tower layers unlocked.

--lock-text --lock-text-unlocked-layers 10  # Train last 10 layers

--lock-text-freeze-layer-norm

boolean

Freeze LayerNorm in locked text layers.

--lock-text-freeze-layer-norm

Checkpointing and Logging

Checkpoints

--save-frequency

integer

default:"1"

Save checkpoint every N epochs.

--save-frequency 1  # Save every epoch
--save-frequency 5  # Save every 5 epochs

--save-most-recent

boolean

Save most recent checkpoint as epoch_latest.pt.

--save-most-recent

--delete-previous-checkpoint

boolean

Delete previous checkpoint after saving new one (saves disk space).

--delete-previous-checkpoint

--resume

string

Resume training from checkpoint path or “latest”.

--resume /path/to/checkpoint.pt
--resume latest  # Resume from latest checkpoint

Logging

--logs

string

default:"./logs/"

Directory for logs and checkpoints.

--logs ./logs/

--name

string

Experiment name (defaults to auto-generated based on timestamp and config).

--name "vit-b32-cc12m-experiment"

--report-to

string

Logging backends: tensorboard, wandb, or tensorboard,wandb.

--report-to tensorboard
--report-to wandb
--report-to tensorboard,wandb  # Both

--log-every-n-steps

integer

default:"100"

Log training metrics every N steps.

--log-every-n-steps 100

Weights & Biases

--wandb-project-name

string

default:"open-clip"

W&B project name.

--wandb-project-name "my-clip-experiments"

--wandb-notes

string

Notes for W&B run.

--wandb-notes "Testing new learning rate schedule"

Evaluation

--imagenet-val

string

Path to ImageNet validation set for zero-shot evaluation during training.

--imagenet-val /data/imagenet/validation/

--imagenet-v2

string

Path to ImageNet-v2 for additional zero-shot evaluation.

--imagenet-v2 /data/imagenet-v2/

--zeroshot-frequency

integer

default:"2"

Run zero-shot evaluation every N epochs.

--zeroshot-frequency 1  # Every epoch

--val-frequency

integer

default:"1"

Run validation every N epochs.

--val-frequency 1

CoCa-Specific Parameters

--coca-contrastive-loss-weight

float

default:"1.0"

Weight for CoCa contrastive loss.

--coca-contrastive-loss-weight 1.0

--coca-caption-loss-weight

float

default:"2.0"

Weight for CoCa caption generation loss.

--coca-caption-loss-weight 2.0

For CoCa fine-tuning on captioning only:

--coca-contrastive-loss-weight 0 \
--coca-caption-loss-weight 1

Distributed Training

--dist-url

string

URL for distributed training initialization.

--dist-url tcp://localhost:12345

--dist-backend

string

Distributed backend: nccl (NVIDIA GPU), hccl (Ascend NPU), or gloo (CPU).

--dist-backend nccl  # Default for GPU

--horovod

boolean

Use Horovod for distributed training.

--horovod

--ddp-static-graph

boolean

Enable static graph optimization for DDP (PyTorch >= 1.11).

--ddp-static-graph

--use-bn-sync

boolean

Use synchronized batch normalization across GPUs.

--use-bn-sync

Advanced Options

Compilation

--torchcompile

boolean

Compile model with torch.compile() (PyTorch >= 2.0).

--torchcompile

--torchscript

boolean

TorchScript the model.

--torchscript

--trace

boolean

Trace model with torch.jit.trace (inference only).

--trace

Model Distillation

--distill-model

string

Teacher model architecture for distillation.

--distill-model ViT-L-14

--distill-pretrained

string

Teacher model pretrained weights.

--distill-pretrained openai

Loss Configuration

--siglip

boolean

Use SigLip (sigmoid) loss instead of standard CLIP loss.

--siglip

--loss-dist-impl

string

Distributed loss implementation override.

--loss-dist-impl custom

Remote Syncing

--remote-sync

string

Remote path to sync checkpoints (S3 bucket or filesystem).

--remote-sync s3://my-bucket/checkpoints

--remote-sync-frequency

integer

default:"300"

Sync to remote every N seconds.

--remote-sync-frequency 600  # Sync every 10 minutes

--remote-sync-protocol

string

default:"s3"

Protocol for remote sync: s3 or fsspec.

--remote-sync-protocol s3

Experimental

--use-bnb-linear

string

Use bitsandbytes linear layers for int8 training (experimental).

--use-bnb-linear SwitchBackLinearGlobal

Other

--seed

integer

default:"0"

Random seed for reproducibility.

--seed 42

--device

string

default:"cuda"

Device for training: cuda or cpu.

--device cuda

--cache-dir

string

Override default cache directory for model/tokenizer downloads.

--cache-dir /path/to/cache

--debug

boolean

Enable debug logging.

--debug

--log-local

boolean

Log on local master (each node) instead of global master only.

--log-local

--copy-codebase

boolean

Copy entire codebase to log directory.

--copy-codebase

Example Configurations

Small-Scale Training (RN50 on CC3M)

python -m open_clip_train.main \
    --train-data "/data/cc3m/train.csv" \
    --dataset-type csv \
    --csv-img-key filepath \
    --csv-caption-key title \
    --batch-size 256 \
    --precision amp \
    --workers 4 \
    --warmup 2000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 30 \
    --model RN50 \
    --save-frequency 5 \
    --report-to tensorboard

Medium-Scale Training (ViT-B/32 on CC12M)

torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data "/data/cc12m/cc12m-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 320 \
    --precision amp \
    --workers 6 \
    --imagenet-val /data/imagenet/validation/ \
    --warmup 10000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --model ViT-B-32 \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --local-loss \
    --gather-with-grad \
    --report-to wandb

Large-Scale Training (ViT-L/14 on LAION-400M)

srun python -u src/open_clip_train/main.py \
    --train-data "/data/laion400m/{00000..41455}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 128 \
    --precision amp \
    --grad-checkpointing \
    --workers 8 \
    --warmup 10000 \
    --lr 5e-4 \
    --wd 0.2 \
    --epochs 32 \
    --model ViT-L-14 \
    --save-frequency 1 \
    --zeroshot-frequency 2 \
    --local-loss \
    --gather-with-grad \
    --force-patch-dropout 0.5 \
    --report-to wandb \
    --remote-sync s3://bucket/checkpoints \
    --delete-previous-checkpoint

Recommended Settings by Model

ViT-B/32

--model ViT-B-32 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.98 \
--eps 1e-6 \
--batch-size 256-512 \
--precision amp

ViT-L/14

--model ViT-L-14 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.98 \
--eps 1e-6 \
--batch-size 128-256 \
--precision amp \
--grad-checkpointing \
--force-patch-dropout 0.5

RN50

--model RN50 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.999 \
--eps 1e-8 \
--batch-size 256-512 \
--precision amp

Next Steps

Single-Node Training

Apply these configurations to single-node training

Distributed Training

Configure distributed training optimizations

Data Preparation

Configure data loading and preprocessing

Fine-tuning

Configure fine-tuning from pretrained models

Documentation Index

​Overview

​Data Configuration

​Training Data

​CSV Data Parameters

​Data Upsampling

​Model Configuration

​Model Selection

​Model Modifications

​Training Hyperparameters

​Batch Size and Epochs

​Learning Rate

​Optimizer

​Gradient Clipping

​Precision and Memory

​Precision

​Memory Optimization

​Data Loading

​Image Preprocessing

​Model Locking (Transfer Learning)

​Image Tower

​Text Tower

​Checkpointing and Logging

​Checkpoints

​Logging

​Weights & Biases

​Evaluation

​CoCa-Specific Parameters

​Distributed Training

​Advanced Options

​Compilation

​Model Distillation

​Loss Configuration

​Remote Syncing

​Experimental

​Other

​Example Configurations

​Small-Scale Training (RN50 on CC3M)

​Medium-Scale Training (ViT-B/32 on CC12M)

​Large-Scale Training (ViT-L/14 on LAION-400M)

​Recommended Settings by Model

​ViT-B/32

​ViT-L/14

​RN50

​Next Steps

Single-Node Training

Distributed Training

Data Preparation

Fine-tuning

Overview

Data Configuration

Training Data

CSV Data Parameters

Data Upsampling

Model Configuration

Model Selection

Model Modifications

Training Hyperparameters

Batch Size and Epochs

Learning Rate

Optimizer

Gradient Clipping

Precision and Memory

Precision

Memory Optimization

Data Loading

Image Preprocessing

Model Locking (Transfer Learning)

Image Tower

Text Tower

Checkpointing and Logging

Checkpoints

Logging

Weights & Biases

Evaluation

CoCa-Specific Parameters

Distributed Training

Advanced Options

Compilation

Model Distillation

Loss Configuration

Remote Syncing

Experimental

Other

Example Configurations

Small-Scale Training (RN50 on CC3M)

Medium-Scale Training (ViT-B/32 on CC12M)

Large-Scale Training (ViT-L/14 on LAION-400M)

Recommended Settings by Model

ViT-B/32

ViT-L/14

RN50

Next Steps